Record Extraction Using Record Segmentation Tree
نویسندگان
چکیده
In spite of extensive study of information extraction from web pages, the existing methods fail to extract all the data from the web pages. Also, the existing methods divide the data extraction into two phases, namely, record region detection and record segmentation. In this paper, we proposed a unified method for data extraction from a structured web page. We propose a new search structure Record Segmentation Tree(RST), and few search pruning techniques on RST to make the extraction faster and efficient. This, method can handle more complicated web pages as we have used token based edit distance instead of string or tree edit distances. And, the partial tree alignment method is used to align the extracted data into a more understandable form. Experiments have been conducted on data sets used in different existing methods and our method gives more efficient result than those existing methods.
منابع مشابه
Information discovery from semi-structured record sets on the Web
The World Wide Web has been extensively developed since its first appearance two decades ago. Various applications on the Web have unprecedentedly changed humans’ life. Although the explosive growth and spread of the Web have resulted in a huge information repository, yet it is still under-utilized due to the difficulty in automated information extraction (IE) caused by the heterogeneity of Web...
متن کاملData Extraction using Content-Based Handles
In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...
متن کاملFirst distribution record of regular echinoids (Echinodermata; Echinoidea) from Chennai Coast,South India
The regular echinoids were recorded from Chennai Coast,Tamilnadu, South India and the animals were belong to 4 families, 5 genera and 5 species. An identification key to generic level and synoptic description are provided. Temnopleurid sea urchin Salmaciella oligopora (Clark, 1916) was recorded for the first time in 20-30m depth between Chennai and Pondicherry Coasts, South East Coast of India....
متن کاملSemi-structured Information Extraction Applying Automatic Pattern Discovery
Information extraction (IE) from semi-structured Web documents is a critical issue for information integration systems on the Internet. Previous work in wrapper induction aim to solve this problem by applying machine learning to automatically generate extractors. For example, WIEN, Stalker, Softmealy, etc. However, this approach still requires human intervention to provide training examples. He...
متن کاملIntrathoracic Airway Tree Segmentation from CT Images Using a Fuzzy Connectivity Method
Introduction: Virtual bronchoscopy is a reliable and efficient diagnostic method for primary symptoms of lung cancer. The segmentation of airways from CT images is a critical step for numerous virtual bronchoscopy applications. Materials and Methods: To overcome the limitations of the fuzzy connectedness method, the proposed technique, called fuzzy connectivity - fuzzy C-mean (FC-FCM), utilized...
متن کامل